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UNIFORM POST SELECTION INFERENCE FOR LAD REGRESSION MODELS 



A. BELLONI, V. CHERNOZHUKOV, AND K. KATO 



>«.i^ ^ Abstract. We develop uniformly valid confidence regions for a regression coefficient in a high-dimensional 

' sparse LAD (least absolute deviation or median) regression model. The setting is one where the number 

1^^, of regressors p could be large in comparison to the sample size n, but only s <g; n of them are needed 

^^ ' to accurately describe the regression function. Our new methods are based on the instrumental LAD 

regression estimator that assembles the optimal estimating equation from either post <?i-penalizod LAD 

fVl ' regression or £i-penalized LAD regression. The estimating equation is immunized against non-regular 

estimation of nuisance part of the regression function, in the sense of Neyman. We establish that in a 
homoscedastic regression model, under certain conditions, the instrumental LAD regression estimator 
of the regression coefficient is asymptotically root-n normal uniformly with respect to the underlying 
sparse model. The resulting confidence regions are valid uniformly with respect to the underlying model. 
The new inference methods outperform the naive, "oracle based" inference methods, which are known 
to be not uniformly valid - with coverage property failing to hold uniformly with respect the underlying 
model — even in the setting with p = 2. We also provide Monte-Carlo experiments which demonstrate 
that standard post-selection inference breaks down over large parts of the parameter space, and the 

^\1 ' proposed method does not. 
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1. Introduction 

We consider the following regression model 

Hi ^ diao + x'il3Q + a, i = !,...,«, (1.1) 

where di is the "main regressor" of interest, whose coefficient ao we would like to estimate and perform 
(robust) inference on. The {xi)f^i are other high-dimensional regressors or "controls" and are treated as 
fixed (di's are random). The regression error e^ is independent of di and has median 0. The errors {£i)f^i 
are i.i.d. with distribution function F{-) and probability density function /e(-) such that F{0) = 1/2 
and fe = /e(0) > 0. The assumption on the error term motivates the use of the least absolute deviation 
(LAD) or median regression, suitably adjusted for use in high-dimensional settings. 
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of Michigan (October 2012). We are grateful to Sara van de Geer, Xuming He, Richard Nickl, Roger Koenker, Vladimir 
Koltchinskii, Steve Portnoy, Philippe RigoUet, and Bin Yu for useful comments and discussions. 
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2 UNIFORM POST SELECTION INFERENCE FOR, LAD 

The dimension p of "controls" Xi is large, potentially much larger than n, which creates a challenge for 
inference on a^. Although the unknown true parameter /3o lies in this large space, the key assumption 
that will make estimation possible is its sparsity, namely T = support (/3o) has s < n elements (where s 
can depend on n; we shall use array asymptotics). This in turn motivates the use of regularization or 
model selection methods. 

A standard (non-robust) approach towards inference in this setting would be first to perform model 
selection via the £i-penalized LAD regression estimator 

(S,^)eargminE„[|y, -d,a-x',/3|] + -||(a,/3')'||i, (1.2) 

and then to use the post-model selection estimator 

(S, p) e argmin |e„[|2/, - d^a - x'M : /3, = if % = o| (1.3) 

to perform "usual" inference for ag. (The notation E„[-] denotes the average over index 1 ^ z ^ n.) 



This standard approach is justified if (jl.2p achieves perfect model selection with probability approach- 
ing 1, so that the estimator (|1.3p has the "oracle" property with probability approaching 1. However 
conditions for "perfect selection" are very restrictive in this model, in particular, requiring significant 
separation of non-zero coefficients away from zero. If these conditions do not hold, the estimator a does 
not converge to ao at the y^-rate - uniformly with respect to the underlying model- which implies that 
"usual" inference breaks down and is not valid. (The statements continue to apply if a is not penalized 
in (|1.2|) . a is restricted in (|1.3p . or if thresholding is applied.) We shall demonstrate the breakdown of 
such naive inference in the Monte-Carlo experiments where non-zero coefhcients in 0q are not significantly 
separated from zero. 

Note that the breakdown of inference does not mean that the aforementioned procedures are not 
suitable for prediction purposes. Indeed, the £i-LAD estimator (|1.2p and post i?i-LAD estimator (|1.3p 
attain (essentially) optimal rates y^{slogp)/n of convergence for estimating the entire median regression 
function, as has been shown in [211 121 (131 123 and in [3] ■ This property means that while these procedures 
will not deliver perfect model recovery, they will only make "moderate" model selection mistakes (omitting 
only controls with coefhcients local to zero). 

To achieve uniformly valid inferential performance we propose a procedure whose performance does 
not require perfect model selection and allows potential "moderate" model selection mistakes. The latter 
feature is critical in achieving uniformity over a large class of data generating processes, similarly to 
the results for instrumental regression and mean regression studied in [27], [2], [7], [6]. This allows us to 
overcome the impact of (moderate) model selection mistakes on inference, avoiding (in part) the criticisms 
in [17] , who prove that the "oracle property" sometime achieved by the naive estimators necessarily implies 
the failure of uniform validity of inference and their semiparametric inefficiency |18j . 

In order to achieve robustness with respect to moderate model selection mistakes, it will be necessary 
to achieve the proper orthogonality condition between the main regressors and the control variables. 



UNIFORM POST SELECTION INFERENCE FOR, LAD 3 

Towards that goal the following auxiliary equation plays a key role (in the homoscedastic case): 

d, = x% + V,, E[w,] =0, i = 1, . . . , n; (1.4) 

describing the relevant dependence of the regressor of interest di to the other controls xi . We shall assume 
the sparsity of ^Oi namely Td — support(0o) has at most s < n elements, and estimate the relation (jl.4p 
via Lasso or post-Lasso methods described below. 

Given Vi, which "partials out" the effect of Xi from di, we shall use it as an instrument in the following 
estimating equations for a^: 

^VfiVi - dtao - x'^/3o)vi] =0, « = 1, ..., n, 

where ip{t) = 1/2— l(i < 1/2). We shall use the empirical analog of this equation to form an instrumental 

LAD regression estimator of ag, using a plug-in estimator for x^/Sq. The estimating equation above has 

the following feature: 

d 



^^-E[(/?(2/i - diao - x',/3)vi 



0, i = l,...,n, (1.5) 



As a result, the estimator of ag will be "immunized" against "crude" estimation of x'^/3o, for example, via a 
post-selection procedure or some regularization procedure. As we explain in Section 5, such immunization 
ideas can be traced back to Neyman ( J19M20J ). 

Our estimation procedure has the following three steps. 

Step 1: Estimation of the confounding function x^/3o in ()l.ip . 
Step 2: Estimation of the instruments (residuals) Vi in (11.41) . 
Step 3: Estimation of the main effect ap based on the instrumental 
LAD regression using Vi as instruments for di. 

Each step is computationally tractable, involving solutions of convex problems and a one-dimensional 
search, and relies on a different identification condition which in turn requires a different estimation 
procedure: 

Step 1 constructs an estimate for the nuisance function x.^/Sq and not an estimate for ap. Here we 
do not need a ^/n-ra,te consistency for the estimates of the nuisance function; slower rate like o{n^^'^) 
will sufhce. Thus, this can be based either on the f i-LAD regression estimator (|1.2|) or the associated 
post-model selection estimator ()1.3|) . 



Step 2 partials out the impact of the covariates Xi on the main regressor di, obtaining the estimate 
of the residuals Vi in the decomposition (II. 4p . In order to estimate these residuals we rely either on 
heteroscedastic Lasso [2j , a version of the Lasso estimator of [23l [9] : 

e argminE„[((ij - x'S)^] + -||f6'||i and set v^ = d,, - x'i9, i = 1,. . .,n, (1.6) 

9 n 

where A and F are the penalty level and data-driven penalty loadings described in [5] (restated in Appendix 
D), or the associated post-model selection estimator (Post-Lasso) [11 [2] defined as 

^e argmin |E„[(<i, - x^Of] : 6*^ = if Oj = o| and set v, = d, - x'J). (1.7) 
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Step 3 constructs an estimator a of the coefficient ao via an instrumental LAD regression proposed in 
[To] , using (wi)"^i as instruments. Formally, a is defined as 






a e arg mf_ L„(a), where L„(a) = ^^ — ^ ^^21 ^-^' (l-^) 



(^(t) = 1/2 — l{t ^ 0} and .4 is a parameter space for ao- We will analyze the choice of ^ = [a — 
Clog^^ n,a + Clog^^ n] with a suitable constant C > 0. lJ Several other choices for A are possible. 

Our main result establishes conditions under which a is root-n consistent for ag , asymptotically normal, 
and achieves the semi-parametric efficiency bound for estimating ao in the current homoscedastic setting, 
provided that (s^ \og^ p)/n -^ and other regularity conditions hold. Specifically, we show that, despite 
possible model selection mistakes in Steps 1 and 2, the estimator a obeys 

a-^V^{a-ao)-^NiO,l), (1.9) 

where ct,^, :— 1 / {4:f^F,[vf]) with /^ = /e(0). An alternative (and more robust) expression for a^ is given 
by Huber's sandwich; 

al = J-^nj-\ where n := E[v'f]/4: and J := E[/,d,w,]. (1.10) 

We recommend to estimate il by the plug-in method and to estimate J by Powell's method [21 . Fur- 
thermore, we show that the criterion function at the true value ag in Step 3 has the following pivotal 
behavior 

nLniao) ^ x'(l)- (l-H) 

This allows the construction of a confidence region A„.^ with asymptotic coverage 1 — ^ based on the 
statistic L„, 

P(ao € An^^) ^ 1 - ^ where A„^^ =^ {a e A : nL,i{a) ^ (1 - ^-quantile of X^(l)}- (1-12) 

Importantly, the robustness with respect to moderate model selection mistakes, which occurs because of 
(jl.Sp . allows the results (|1.9p and (jl.lip to hold uniformly over a large range of data generating processes, 
similarly to the results for instrumental regression and partially linear mean regression model established 
in [6l[27l[2]. One of our proposed algorithms explicitly uses €i-regularization methods, similarly to [27] 
and [2], while the main algorithm we propose uses post-selection methods, similarly to [Bl[2]. 

Throughout the paper, we use array asymptotics - asymptotics where the model changes with n - to 
better capture some finite-sample phenomena such as "small coefficients" that are local to zero. This 
ensures the robustness of conclusions with respect to perturbations of the data-generating process along 
various model sequences. This robustness, in turn, translates into uniform validity of confidence regions 
over substantial regions of data-generating processes. 



For numerical experiments we used C = 10(En[d|]) ^'^ and typically we normalize E„[d^] = 1). 
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1.1. Notation and convention. Denote by (rj,J^, P) the underlying probability space. The notation 
E„[-] denotes the average over index 1 ^ i ^ n, i.e., it simply abbreviates the notation n^^'^^^ii']- 
For example, E„[a:^^] — n~^J27=i^^j- Moreover, we use the notation E[-] = E„[E[-]]. For example, 
E[vf] = n-i X;Li Ebf]- For a function / : R x R x Rp ^ R, we write G„(/) = n-^/^ ^"^^ (/(y„ di, Xi) - 
E[/(yi, di,Xi)]). The /2-norni is denoted by || • ||, and the ^o-norni, || • ||o, denotes the number of non-zero 
components of a vector. Denote by || • || 00 the maximal absolute element of a vector. For a sequence {zi)f^i 
of constants, we write |J2:i||2.n = ■\/E„[z|]. For example, for a vector S £ MP, \\x^S\\2,n — ^E„[(x^(5)2] 
denotes the prediction norm of S. Given a vector (5 G R'', and a set of indices T C {1, . . . ,p}, we denote 
by St G R^ the vector such that {St)j = Sj ii j £ T and {6t)j = ii j ^ T. Also we write the support 
of S as support(^) = {j £ {1, ■■■,p} : Sj 7^ 0}. We use the notation (a)+ = max{a, 0}, aW b = max{a, b}, 
and a A fe = min{a, b}. We also use the notation a < 6 to denote a ^ cb for some constant c > that does 
not depend on n; and a <p b to denote a = Op(b). The arrow ~-+ denotes convergence in distribution. 

We assume that the quantities such as p (the dimension of Xi), s (a bound on the numbers of non-zero 
elements of /3o and ^o); and hence yi,Xi, j3q,9q,T and Td are all dependent on the sample size n, and allow 
for the case where p = p„ — > 00 and s = s„-^cx)asn— >oo. However, for the notational convenience, 
we shall omit the dependence of these quantities on n. 

2. The Methods, Conditions, and Results 

2.1. The methods. Each of the steps outlined before uses a different identification condition. Several 
combinations are possible to implement each step, two of which are the following. 

Algorithm 1 (Based on Post-Model Selection estimators). 

(1) Run Post-^i-penalized LAD (|1.3p of yi on dj and Xi\ keep fitted value x[f3. 

(2) Run Post-Lasso (|1.7p of di on Xi] keep the residual Vi := di — x[9. 

(3) Run Instrumental LAD regression (jl.Sp of yi — x'ij3 on di using Vi as the instrument for di to 
compute the estimator a. Report a and/or perform inference based upon (jl.9p or (|1.12l) . 

Algorithm 2 (Based on Regularized Estimators). 

(1) Run £i-penalized LAD (|1.2p of yi on di and Xi] keep fitted value x^/3. 

(2) Run Lasso of (jl.6p di on Xi] keep the residual Vi := di — x[0. 

(3) Run Instrumental LAD regression (jl.Sp of yi — x[f3 on di using Vi as the instrument for di to 
compute the estimator a. Report a and/or perform inference based upon (|1.9p or (|1.12l) . 

Comment 2.1 (Penalty Levels). In order to perform ^i-LAD and Lasso, one has to suitably choose the 
penalty levels. In the Supplementary Appendix [D] we provide implementation details including penalty 
choices for each step of the algorithm, and in all what follows we shall obey the penalty choices described 
in Appendix [P] 

Comment 2.2 (Differences). Algorithm 1 relies on Post-£i-LAD and Post-Lasso while Algorithm 2 relies 
on ^i-LAD and Lasso. Since Algorithm 1 refits the non-zero coefficients without the penalty term it has 
a smaller bias. Therefore it does rely on £i-LAD and Lasso obtaining sparse solutions which in turn 
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typically relies on restricted isometry conditions [IJE]- Algorithm 2 relics on penalized estimators. Step 
3 of both algorithms relies on instrumental LAD regression with estimated data. 

Comment 2.3 (Alternative Implementations). As discussed before, the three step approach proposed 
here can be implemented with several different methods each with specific features. For instance, Dantzig 
selector, square-root Lasso or the associated post-model selection could be used instead of Lasso or Post- 
Lasso. Moreover, the instrumental LAD regression can be substituted by a 1-step estimator from the 
^i-LAD estimator a of the form a — a + {En[fevf])~^En[(p{yi — diO. — a;^/?)ui] or by a LAD regression 
with all the covariates selected in Steps 1 and 2. 

2.2. Regularity Conditions. Here we provide regularity conditions that are sufficient for validity of 
the main estimation and inference results. We begin by stating our main condition, which contains the 
previously defined approximate sparsity as well as other more technical assumptions. Throughout the 
paper, let c and C be positive constants independent of n, and let €„ /* oo,(5„ \ 0, and A„ \ be 
sequences of positive constants. Let K^. := maxi^i^„ Ha^iHoo- 

Condition I. (i) (ei)"^j^ is a sequence of i.i.d. random variables luith common distribution func- 
tion F such that F{Q) = 1/2, (wi)"^x *■' ^ sequence of independent mean-zero random variables in- 
dependent of (ei)"^]^, and (a;^)"^]^ is a sequence of non- stochastic vectors in W of covariates normal- 
ized in such a way that E„[a;|-] = 1 for all 1 ^ j ^ p. The sequence {(j/i, di)'}"=i "/ random vec- 
tors are generated according to models \1.1]) and p^. (ii) c ^ E[wf] ^ C for all 1 ^ i ^ n, and 
F,[df]-\-F,[vf]-\-maxi^j^p(E[x'^,d'f]-\-Fi[\xijVi\'^]) ^ C. (Hi) There exists s = s„ ^ 1 such that ||/3o||o ^ s and 
ll^ollo ^ s. (iv) The error distribution F is absolutely continuous with continuously differ entiable density 
/,(•) such that f,{0) ^ c> and f,{t)V\f',{t)\ ^CforallteR, and (v) {K^+K^s^+s^)\og^{pVn) s^nSn- 

Comment 2.4. Condition I(i) imposes the setting discussed in the previous section with the zero con- 
ditional median of the error distribution. Condition I(ii) imposes moment conditions on the structural 
errors and regressors to ensure good model selection performance of Lasso applied to equation (|1.4p . The 
approximate sparsity I(iii) imposes sparsity of the high-dimensional vectors /3o and 9o- In the theorems 
below we provide the required technical conditions on the growth of slogp since it is dependent on the 
choice of algorithm. Condition I(iv) is a set of standard assumptions in the LAD literature (see [M]) and 
in the instrumental quantile regression literature [10]. Condition I(v) restricts the sparsity index, so that 
s^ log {p\/n) = o{n) is required; this is analogous to the standard assumption s'^(logn)^ — o{n) (see [11 ) 
invoked in the LAD analysis without any selection (i.e, where p — s). Most importantly, no assumptions 
on the separation from zero of the non-zero coefficients of Oq and /3o are made. 

The next condition concerns the behavior of the Gram matrix E„[a;ia;^] where Xi = {di,x[y. Whenever 
p+ 1 > n, the empirical Gram matrix E„ [xix'^ does not have full rank and in principle is not well-behaved. 
However, we only need good behavior of smaller submatrices. Define the minimal and maximal ?7i-sparse 
eigenvalue of E„[2;iX^] as 

0min(™) := mm ——5 and 0max("i) := max — . (2-13) 
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To assume that 0min {in) > requires that all empirical Gram submatrices formed by any m components 
of Xi are positive definite. We shall employ the following condition as a sufficient condition for our results. 

Condition SE. There exists a sequence of constants £n — > oo such that the maximal and minimal £nS- 
sparse eigenvalues are bounded from below and away from zero, namely with probability at least 1 — A„, 

where < k' < k" < oo are constants independent of n. 

Comment 2.5. Condition SE is quite plausible for many designs of interest. Essentially it can be 
established by combining tail conditions of the regressors and a growth restriction on s and p relative to n. 
For instance, Theorem 3.2 in [25] (see also [28j and [T]) shows that Condition SE holds for i.i.d. zero- mean 
sub-Gaussian regressors and slog (nVp) ^ (5nn; while Theorem 1.8 [2^ (see also Lemma 1 in [^) shows 
that Condition SE holds for i.i.d. uniformly bounded zero-mean regressors and s(log n) log(pVn) ^ (5„n. 

2.3. Results. We begin with considering Algorithm 1. 

Theorem 1 (Robust Inference, Algorithm 1). Let a be obtained by Algorithm 1. Suppose that Conditions 
I and SE are satisfied for all n ^ 1. Moreover, suppose that with probability at least 1 — A„, ||/3||o ^ C**- 
Then, as n ^^ oo and for a'^ — l/{4:f^T^[vf]), 

a^^\fn{d — olq) ^ N{0, 1) and nLn{ao) ^ X^(l)- 

Theorem [T] establishes the first main result of the paper. Theorem [1] relies on the post model selection 
estimators which in turn hinge on achieving sufhciently sparse estimates /? and 0. Sparsity of the former 
can be directly achieved under sharp penalty choices for optimal rates as discussed in the Supplementary 
Appendix ID. 21 The sparsity for the latter potentially requires heavier penalty as shown in [3^ . Alterna- 
tively, sparsity for the estimator in Step 1 can also be achieved by truncating the smallest components 
of estimate /3o 

Next we turn to the analysis of Algorithm 2 which relies on the regularized estimators instead of 
the post-model selection estimators. Theorem [2] below establishes that Algorithm 2 achieves the same 
inferential guarantees as the results in Theorem [1] for Algorithm 1. 

Theorem 2 (Robust Inference, Algorithm 2). Let a be obtained by Algorithm 2. Suppose that Conditions 
I and SE are satisfied for all n ^ 1. Moreover, suppose that with probability at least 1 — A„, ||/3||o ^ Cs. 
Then, as n -^ oo and for a^ — l/{4:f^Fj[vf]), 

a^^yfn{a — qq) ^ Af(0, 1) and nL„(ao) ^ X^(l)- 

Theorem [5] establishes the second main result of the paper. 

An important consequence of these results is the following corollary. Here Q„ denotes a collection of 
distributions for {(yi,rfi)'}"=i ^'^'^ f^'" Qn G Q,n the notation Pg^ means that under Pq„, {(yi,rfi)'}i"=i 
is distributed according to Qn. 



Lemma [3] in Appendix [C] formally shows that a suitable truncation preserves the rate of convergence under our 
conditions. 
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Corollary 1 (Uniformly Valid Confidence Intervals). Let a be the estimator of ao constructed ac- 
cording to Algorithm 1 (resp. Algorithm 2) and let Qn be the collection of all distributions of {{yi^di)'}'^^^ 
for which the conditions of Theorem]^ (resp. Theorem\^ are satisfied for given n ^ 1. Then as n ^ oo, 
uniformly in Qn € Qn 

Pq„ ("0 e [a ± anZ^/2/Vn]) ^1-6 and Pq„ (ao G A„_^) ^ 1 - ^, 

where z^m = $""'^(1 — C/2) and A„^^ = {a G y^ : nLn{oi) ^ (1 — ^)-quantile o/x^(l)}. 

Corollary [T] establishes the third main result of the paper; it highlights the uniformity nature of the 
results. As long as the overall sparsity requirements hold, imperfect model selection in Steps 1 and 
2 do not compromise the results. The robustness of the approach is also apparent from the fact that 
Corollary [T] allows for the data-generating process to change with n. This result is new even under the 
traditional case of fixed-p asymptotics. Condition I and SE together with the appropriate side conditions 
in the theorems explicitly characterize regions of data-generating processes for which the uniformity 
result holds. Simulations results discussed next also provide an additional evidence that these regions are 
substantial. 

3. Monte-Carlo Experiments 

In this section we examine the finite sample performance of the proposed estimators. We focus on the 
estimator associated with Algorithm 1 based on post- model selection methods. 

We considered the following regression model: 

y = dao + x'{cy9o) +e, d ^ x' (cdOo) + v , (3-14) 

where ao = 1/2, d^j = l/j'^,j = 1, . . . , 10, and Oqj = otherwise, x = (1, z')' consists of an intercept and 
covariates z ~ A^(0, E), and the errors e and v are independently and identically distributed as A^(0, 1). 
The dimension p of the covariates x is 300, and the sample size n is 250. The regressors are correlated 
with Eij = pl*"^! and p — 0.5. The coefficients Cy and Cd are used to control the R^ of the reduce 
form equation. For each equation, we consider the following values for the R^: {0, 0.1, 0.2, . . . , 0.8, 0.9}. 
Therefore we have 100 different designs and results are based on 500 repetitions for each design. For each 
repetition we draw new vectors Xi^s and errors e^'s and w^'s. 

The design above with x'{cy9o) is a sparse model. However, the decay of the components of ^o rules 
out typical "separation from zero" assumptions of the coefficients of "important" covariates (since the 
last component is of the order of 1/?t.), unless Cy is very large. Thus, we anticipate that "standard" 
post-selection inference procedures - which rely on model selection of the outcome equation only - work 
poorly in the simulation study. In contrast, based upon the prior theoretical arguments, we anticipate 
that our instrumental LAD estimator- which works off both equations in p.l4|) - to work well in the 
simulation study. 

The simulation study focuses on Algorithm 1. Standard errors are computed using the formula (jl.lOp . 
(Algorithm 2 worked similarly, though somewhat worse due to larger biases). As the main benchmark 
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we consider the standard post-model selection estimator a based on the post i?i-penalized LAD method, 
as defined in (11.311. 



In Figure [TJ we display the (empirical) rejection probability of tests of a true hypothesis a = ag, 
with nominal size of tests equal to 0.05. The left-top plot shows the rejection frequency of the standard 
post-model selection inference procedure based upon a (where the inference procedure assumes perfect 
recovery of the true model). The rejection frequency deviates very sharply from the ideal rejection 
frequency of 0.05. This confirms the anticipated failure (lack of uniform validity) of inference based upon 
the standard post-model selection procedure in designs where coefficients are not well separated from zero 
(so that perfect recovery does not happen). In sharp contrast, the right top and bottom plots show that 
both of our proposed procedures (based on estimator a and the result (jl.9p and on the statistic L„ and 
the result (|1.12p ) perform well, closely tracking the ideal level of 0.05. This is achieved uniformly over all 
the designs considered in the study, and this confirms our theoretical results established in Corollary 1. 

In Figure [2j we compare the performance of the standard post-selection estimator a (defined in (|1.3p ) 
and our proposed post-selection estimator a (obtained via Algorithm 1) . We display results in three 
different metrics of performance - mean bias (top row), standard deviation (middle row), and root mean 
square error (bottom row) of the two approaches. The significant bias for the standard post-selection 
procedure occurs when the indirect equation (jl.4p is nontrivial, that is, when the main regressor is 
correlated to other controls. Such bias can be positive or negative depending on the particular design. 
The proposed post-selection estimator a performs well in all three metrics. The root mean square error 
for the proposed estimator a are typically much smaller than those for standard post-model selection 
estimators a (as shown by bottom plots in Figure [2]). This is fully consistent with our theoretical results 
and minimax efficiency considerations given in Section 5. 

4. Generalization to Heteroscedastic Case 

We emphasize that both proposed algorithms exploit the homoscedasticity of the model (jl.ip with 
respect to the error term e^. The generalization to the heteroscedastic case can be achieved as follows. 
In order to achieve the semiparametric efficiency bound we need to consider the weighted version of the 
auxiliary equation (II. 4p . Specifically, we can rely on the following of weighted decomposition: 



nd, = nx%+vl E[/,i;*] = 0, z=l,...,n, (4.15) 

where the weights are conditional densities of error terms e^ evaluated at their medians of 0, 

f,^U{0\d^,x,), i = l,...,n, (4.16) 

which in general vary under heteroscedasticity. With that in mind it is straightforward to adapt the 
proposed algorithms when the weights {fi)^^i are known. For example Algorithm 1 becomes as follows. 

Algorithm 1' (Based on Post-Model Selection estimators). 

(1) Run Post-£i-penalized LAD of yi on di and Xi; keep fitted value x'^/S. 

(2) Run Post-Lasso of fidi on fiXi] keep the residual v* := fi{di — x[9). 
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(3) Run Instrumental LAD regression of yt — x[P on di using v* as the instrument for di to compute 
the estimator a. Report a and/or perform inference. 

An analogous generalization of Algorithm 2 based on regularized estimator results from removing the 
word "Post" in the algorithm above. 

Under similar regularity conditions, uniformly over a large collection Q* of distributions of {(y^, (ii)'}"^i, 
the estimator a above obeys 

(4E[wf ])i/2V^(a - ao) ^ N{0, 1). (4.17) 

Moreover, the criterion function at the true value ao in Step 3 also has a pivotal behavior, namely 

nLniao) ^ x'(l), (4-18) 

which can also be used to construct a confidence region A„.^ based on the L„-statistic as in (I1.12[) with 
coverage 1 — ^ uniformly over the collection of distributions Q'^ . 

In practice the density function values (/i)"=i are typically unknown and need to be replaced by 
estimates (/i)"=i- The analysis of the impact of such estimation is very delicate and is developed in the 
companion work 8 , which considers the more general problem of uniformly valid inference for quantile 
regression models in approximately sparse models. 

5. Discussion and Conclusion 

5.1. Connection to Neymanization. In this section we make some connections to Neyman's C{a) 
test ([ini HO])- For the sake of exposition we assume that {yi,Xi,di)'2^i are i.i.d. but we shall use the 
heteroscedastic setup introduced in the previous section. We consider the estimating equation for ao: 

E[(^(2/i - diao - x'il3o)vi\ = 0. 

Our problem is to find useful instruments Vi such that 

d 

If this property holds, the estimator of ao will be "immunized" against "crude" or nonregular estimation 
of /3o, for example, via a post-selection procedure or some regularization procedure. Such immunization 
ideas are in fact behind Neyman's classical construction of his C(a) test, so we shall use the term 
"Neymanization" to describe such procedure. There will be many instruments Vi that can achieve the 
property stated above, and there will be one that is optimal. 

The instruments can be constructed by taking Vi := Zi/ fi, where z^ is the residual in the regression 
equation: 

Widi = Wimo{xi) + Zi, ¥,[wiZi\xi\ = 0, (5.19) 

where Wi is a nonnegative weight, a function of {d^, Zi) only, for example Wi = 1 or Wi — fi - the latter 
choice will in fact be optimal. Note that function TOo(xi) solves the least squares problem 

minE [{widi — Wih{xi)}'^'\ , (5.20) 
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where % is the class of measurable functions h{xi) such that ^[wfh'^{xi)] < oo. Our assumption is that 
the mo{xi) is a sparse function x^Oq, with ||0o||o ^ * so that 

Widi = Wix[6o + Zi, ¥i[wiZi\xi\ = 0. (5-21) 

In finite samples, the sparsity assumption allows to employ post-Lasso and Lasso to solve the least squares 
problem above approximately, and estimate Zi . Of course, the use of other structured assumptions may 
motivate the use of other regularization methods. 

Arguments similar to those in the proofs show that, for ^Jn{a — cto) = 0(1), 

^/n{E„[(/3(J/i - dia - x[l3)v-{\ - ^n[^{yi - diU - x^/3o)ui]} = op(l), 

for /? based on a sparse estimation procedure, despite the fact that j3 converges to /3o at a slower rate 
than 1/ y/n. That is, the empirical estimating equations behave as if /3o is known. Hence for estimation 
we can use a as a minimizer of the statistic: 

where c„ = E„[w^]/4. Since L„(ao) -^ X^(l)j we can also use the statistic directly for testing hypotheses 
and for construction of confidence sets. 

This is in fact a version of Neyman's C{a) test statistic, adapted to the present non-smooth set- 
ting. The usual expression of C{a) statistic is different. To see a more familiar form, note that 
^0 — ^[WiXiX^]~E[wfdiX^], where A" denotes a generalized inverse of A, and write 

Vi = {wi/ fi)di - {wi/ fi)x^E[w1xix[]^Yj[w1dix[], and (pi := ip{yi - d^a - x[l3), 

so that. 

This is indeed a familiar form of a C{a) statistic. 

The estimator a that minimizes L„ up to op(l), under suitable regularity conditions, 

CT-^V^{a - ao) - iV(0, 1), al = -^¥.[fd^v^]-Mv1]. 

The smallest value of a"^ is achieved by using Vi — v* induced by setting Wi — ff. 

<' - \nvfr'- (5.22) 

Thus, setting Wi — fi gives an optimal instrument v* amongst all "immunizing" instruments generated 
by the process described above. Obviously, this improvement translates into shorter confidence intervals 
and better testing based on either a or L„. While Wi — fi is optimal, fi will have to be estimated in 
practice, resulting actually in more stringent condition than when using non-optimal, known weights, e.g., 
Wi = I. The use of known weights may also give better behavior under misspecification of the model. 
Under homoscedasticity, Wi = 1 is an optimal weight. 
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5.2. Minimax Efficiency. There is also a clean connection to the (local) minimax efficiency analysis 
from the semiparametric efficiency analysis. |16) derives an efficient score function for the partially linear 
median regression model: 

Si = 2ip{yi - diao - a;-/3o)/i[rfi - mo{x)], 
where rriQ^Xi) is mo{xi) in (j5.19p induced by the weight Wi = ff- 

"'"^^'^ - Eifflx,] ■ 
Using the assumption mQ(xi) = x'^B'q , where |16'q||o ^ s <C n is sparse, we have that 

St = 2ip{yi - diao - x'^Po)v* , 

which is the score that was constructed using Neymanization. It follows that the estimator based on the 
instrument v* is actually efficient in the minimax sense (see Theorem 18.4 in |15|). and inference about 
aa based on this estimator provides best minimax power against local alternatives (see Theorem 18.12 
inim). 

The claim above is formal as long as, given a law Q*^, the least favorable submodels are permitted 
as deviations that lie within the overall model Q„. Specifically, given a law Q* , we shall need to allow 
for a certain neighborhood Qfj of Q* such that Q* e Q^ C Qn, where the overall model Q„ is defined 
similarly as before, except now permitting heteroscedasticity (or we can keep homoscedasticity fi — fe to 
maintain formality). To allow for this we consider a collection of laws indexed by a parameter t = (^1,^2), 
generated by: 

y, = dtia*o + h)+x'^{/3*+t2e*o)+e,, \\t\\ ^ d, (5.23) 

hdt - hx%+v*, E[/,<|x,]=0, (5.24) 

where |i/3olio + ||^o||o ^ s and conditions as in Section[2]hold. The case withi = generates the law Q*; by 
varying t within 5-ball, we generate the set of laws, denoted Q^, containing the least favorable deviations 
from i = 0. By [16], the efficient score for the model given above is Si, so we cannot have a better regular 
estimator than the estimator whose influence function is J^^Si, where J — FilSf]. Since our overall model 
Qn contains Q„, all the formal conclusions about (local minimax) optimality of our estimators hold from 
theorems cited above (using subsequence arguments to handle models changing with n) . Our estimators 
are regular, since under any law Q^ in the set Q^ with (5 — >■ 0, the first order asymptotics of y/n{d — ao) 
does not change, as a consequence of theorems in Section [2] (in fact our theorems show more than this). 

5.3. Conclusion. In this paper we propose a method for inference on the coefficient ag of a main regressor 
that holds uniformly over many data-generating process which is robust to possible "moderate" model 
selection mistakes. The robustness of the method is achieved by relying on a Neyman type estimating 
equation whose gradient with respect to the nuisance parameters is zero. In the present homoscedastic 
setting the proposed estimator is asymptotically normal and also achieves the semi-parametric efficiency 
bound. 
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Appendix A. Instrumental LAD Regression with Estimated Inputs 

Throughout this section, let 

'^a,0.e{yi, di,Xi) = (1/2 - l{yi s^ a;-/3 + dia}){di - x\Q) 

= (1/2 - l{y, s^ x\^ + d,a}){«, - x\(B - ^o)}. 

For fixed a G R and /?, € R^, define the function 

r(a,/3, 6*) := E[-0a,/3, 0(2/1, di,a;i)]. 

For the notational convenience, let h — (/?'. 6*')', /iq = (/?o, ^0)' a-nd h — (/?', 6*')'. The partial derivative of 
r(a,/3,f?) with respect to a is denoted by ri(a,/3,6') and the partial derivative of r(a,/3,6') with respect 
to /i = {(3' ,9')' is denoted by r2(a,/3,0). Consider the following high-level condition. Here (ji' ,9')' is a 
generic estimator of {P'o,9'q)' (and not necessarily ^i-LAD and Lasso estimators, reps.), and a is defined 
by ci £ argminag^L„(a) with this {P' ,9')' , where A here is also a generic (possibly random) compact 
interval. We assume that {(3\9')' ,A and a satisfy the following conditions. 

Condition ILAD. (i) f,{t) V |/,'(i)| < C for all t e M, ^vj] ^ c> and ^vf] V %dj] «: G. 
Moreover, for some sequences 5„ \ and A„ \ 0, with probability at least 1 — A„, 
(ii) {a : ja — ao| ^ n^^'"^ / ^n\ C A^ where ^ is a (possibly random) compact interval; 
(iii) the estimated parameters {f3',9'y satisfy 

{1 V max (E[|«,|] V \x',i9 - eo)|)}'^'||a;:(^- /3o)||2.n «: S^n-^/^ H(e - 9o)h,n ^ &^n~^^\ (A.25) 

sup |G„(V'„ s-g- V'a,/3o,eo)l =^ ^", (A.26) 

where recall that G„(/) = rr'^^'^YTi=i{Ayi-,di-,^i) - E[/(yi, dj, a;^)]}; and lastly 
(iv) the estimator a satisfies |a — ao| ^ i5ri- 

Comment A.l. Condition ILAD suffices to make the impact of the estimation of instruments negligible 
on the first order asymptotics of the estimator a. We note that Condition ILAD covers several different 
estimators including both estimators proposed in Algorithms 1 and 2. 

The following lemma summarizes the main inferential result based on the high level Condition ILAD. 
Lemma 1. Under Condition ILAD we have, for af^ — l/(4/^^E['i;f]), 

f^« ^\/^(" - "0) -^ N{0, 1) and nL„(ao) -^ X^(l), 

Proof of LemmaUi We shall separate the proof into two parts. 
Part 1. (Proof for the first assertion). Observe that 

^n['4^^^^^g{yi,dt,x,)] = E„[ipao,i3o,eoiyi> d^, x^)] + E„[ilj-^^^g{y„ d^, x^) - ■il;ao,i3o,eo{yi, d^, X,)] 
= ¥.n[ipao,Po,eoiyi^ di, Xi)] +T{a,J5,9) 

= 1 + 11 + 111 + IV. 
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By Condition ILAD(iii) (|A.26P we have with probabihty at least 1 - A„ that \III\ ^ Snn^^f^. We 
wish to show that 

|//+(/,EK2])(d-ao)| <p5r,n-^l'' + 5,,\a~ao\. (A.27) 

Observe that 

r(a, P, 6) = r(a, /3o, ^o) + r(a, (3, 6) - r(a, /3o, ^o) 

= r(a,/3o,0o) + {r(a,^,?)-r(a,/3o,0o)-r2(a,/3o,0o)'(/?-/io)} + r2(a,/3o,0o)'(ft-/io). 

Since r(ao,/3o,^o) = Oj by Taylor's theorem, there exists some point a between ao and a such that 
r(a, /3o, 6'o) = ri(a, /3o, 6'o)(a — ao)- By its definition, we have 

Since f^ = /e(0) and d^ — x[6q + Vi with E[wj] — 0, we have ri(ao, /3o,6'o) — ~ffE[diVi] = — /^^[wf]. Also 
|ri(a,/3o,^o)-ri(ao,/3o,0o)K|E[{/e(O)-/,(d,(a-ao)Kw.]Kqa-ao|E[|d2w,|]. 

Hence r(a,/3o,^o) = -/Jb,2] + 0(l)|d - ao|. 
Observe that 

" 1^ -E[(l/2-l{y, s=:x^/3 + d,a})x,] yl' 

Note that since E[/,(0)K-a;^6'o)a;,] = /.E^a;,] = andE[(l/2-l{2/, s$ x[po+d,ao})x^] = E[(l/2-l{e, s$ 
0})a;i] = 0, we have r2(ao,/3oj^o) = 0. Moreover, 

|r2(a,/?o,eo)'(/J-MI = l{r2(a,/3o,eo) -r2(ao,/3o,0o)}'(/i-/io)| 

< \n{h{d^{a - ao)) - /.(0)Ha;^](^- /?o)| 
+ |E[{FK(a - ao)) - Fmx[]{e - e^)]\ 

< 0(l){||a;U^- /3o)||2,„ + H{e~ 0o)||2,4|a - aol 
= Op{5n)\a- aol- 

Hence |r2(a,^o, ^o)'(^ - ^o)l ^p 5n\a- a^]. 

Denote by r22(a,/3,6') the Hessian matrix of r(a,/3, 0) with respect io h— {j3',6'y. Then 

P (^ 0^ (-mK{l3 - /3o) + d,{a - ao)){d, " 2:^e)x,x;;] E[/,K(/3 - /3o) + d,(a - ao))x,x'S\ 

so that 

(/^ - hoyr22{a,f3,9){h ~ ho) ^ |(^- /3o)'E[/;(x^(/3 - M + d,{a ~ ao)){d, ~ x'Sx^x',]0 - f3„)\ 

+ 2\0 - iSoYnfMiP - M + d,{a - ao))x,x[]{e^ 0o)| 
^ C{ max n\d^ ~ x%\\\x\i$ ~ fio)\,ln + 2|lxK^- /3o)||2,„ • \\x\(Q ^ Q^)hA- 

1 < I < n 
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Here \dt-x[9\ = \vi - Xi{9 ~ 6o)\ ^ \vi\ + \x[{6 -9o)\. Hence by Taylor's theorem together with IL AD (iii), 
we conclude that 

|r(a, p, 9) - r(d, /?o, ^o) - r2(a, i3o, eoYih - ho)\ <p Kn-^i^. 

This leads to the expansion in (jA.27p . 

We now proceed to bound the fourth term. By Condition ILAD(iii) we have with probability at least 
1 — A„ that |d — Qfol ^ ^n- Observe that 

= (l{e, ^ 0} - l{e, ^ d,{a - ao)})f,, 

so that |(?/'a,/3o,eo - V'ao,^o,eo)(y»;<^»:2:i)| < l{|ej| s^ (5„|di|}|wi| whenever \a - ao| ^ (5„. Since the class of 
functions {(y, d, x) i— >■ (V'a.^o.So ~ '/'ao./3o,eo)(2/i c'l a;) : ja — ao| ^ (^n} is a VC subgraph class with VC index 
bounded by some constant independent of 71, using (a version of) Theorem 2.14.1 in |25| . we have 

sup |CJ„(V'o.^o,eo - ^ao,,3o,eo)l <P (E[l{|e.| ^ 8,,%M\Y'^ <P ^T ■ 
|a— Qol^^Ti 

This implies that \IV\ <p Si^^n'^^^. 

Combining these bounds on II, III and IV, we have the following stochastic expansion 

'^"[V'Q,^,e(y«''^i'^i)] = ~{fMvt])ia - ao) +En[ipao,Po,0oiyi'di,Xi)] + Op(5,y^7i"^/^) + Op{Sn)\a~ao\. 

Let a* = ao + {feE[vf])~^¥.n[ipao,i3o,Soiyi^'^i^^i)]- Then a* £ A with probability 1 — o(l) since 
Iq;* ~ Qfol ^p 7^^^^^■ It is not difficult to see that the above stochastic expansion holds with a replaced 
by a*, so that 

Therefore, \En[ip^^^jiyi,dt,Xi)]\ ^ \En[ip^,^^giyi,di,Xi)]\ = Opihll'^-nr'^l'^), so that 
(/,EK2])(d - ao) = E„[7/.„„,^„,eo(y„d„x,)] + Op(<Sy2n-i/2), 
which immediately implies that a~^^/n(6L — qq) ~~* A^(0, 1) since by the Lyapunov CLT, 

(EK2]/4)-i/2V^E„[7^„„,0„^9„(2/„d„x,)] - iV(0,l). 

Part 2. (Proof for the second assertion). First consider the denominator of L„(q:o). We have that 

|E„[u,-] - E„[u,- ]| = |E„[(Ui - Ui)(Ui + Ui)]| < 119:; - ■i;i||2,n||^^i + Wi||2.n 

^ \\X^S~ ^0)||2.„(2||«.||2,« + \\^^(0~ eo)||2,„) <P ^„, 

where we have used the fact that ||wi||2.n ^p (E[Wj^])^^^ = 0(1) (which is guaranteed by ILAD(i)). 
Next consider the numerator of L„(q;o). Since ^\il}aQSafi,Xyi^'^iT'^i)\ = we have 

^"[V'ao,^,e(y»''^»'^0] ='^"^^^'G„(7/;^^^^g-- V'Qo./3o,eo) + r(ao,^,6i) + E„[-0ao,/3o,eo(y»,c?»,a;i)]- 
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By Condition ILAD(iii) and the previous calculation, we have 

l^"(^Qo,^,e ~ ^«o:/3o,eo)l Sp ^n and |r(ao,;^,^)| <p (5„n"^/^. 
Therefore, using the simple identity that nA^ = nB^^ + n(A„ — BnY + 2ni?„(A„ — i?„) with 

An = E„[V^^^j^g(2/i,di,a;i)] and B„ ^¥.n[i}^ao,l3ofio{Viidi,Xi)] <p (E[wf])n~^/^, 
we have 

""^"^""^ em " m] — 

since E[vf] ^ c is bounded away from zero. The result then follows since 



|2 



(EK2]/4)-l/2^E„[^„„,^„,eo(^/^,rf^,a;0] - iV(0,l). 



n 



Comment A. 2 (On 1-step procedure). An inspection of the proof leads to the following stochastic 
expansion: 

En[tpS,p,eiy^'d.^,X^)] = - (/eE[wf ]) (S - Qfo) + En[lpao..f3o-eoiyi^ ^i, Xi)] 

+ Op{8]l^n-^l^ + (5„n-i/4|a - ao| + \a- aoP), 

where a is any consistent estimator of ofQ- Hence provided that jS — agj — op{rir^l^\ the remainder 
term in the above expansion is op(n~^l'^\ and the 1-step estimator a defined by 

a = S+ (E„[/eV^])"^E„[?/>g__gg(2/j,dj,Xj)] 

has the following stochastic expansion: 

a = S + {!,^o1\ + op(n-i/4)}-i{_(jJ[„2j)(g _ ^^) _^_ E„[V;„„,;3o,eo(y», rf», a^»)] + op(n-i/2)} 

= "0 + (/eE[-uf])"^E„[V'a(,,0o,eo(2/i>c?i,a;i)] +op(n"^/^), 

so that a^'^^/nioi — ao) ~-^ A^(0, 1). 

Appendix B. Proof of Theorem [T] 

The proof of Theorem [T] uses the properties of Post-£i-LAD and Post-Lasso. We will collect these 
properties together with required regularity conditions in Appendix |DJ 

Proof of Theorem [IJ Wc will verify Condition IL AD and the desired result then follows from Lemma [T] 
The assumptions on the error density /e(-) in Condition ILAD(i) are assumed in Condition I(iv). The 
moment conditions on dt and Vi in Condition ILAD(i) are assumed in Condition I(ii). 

Condition SE implies that Kq is bounded away from zero with probability 1 — A„ for n sufficiently large, 
see [5]. Step 1 relies on Post-£i-LAD. By assumption with probability 1 — A„ we have s' — ||/3||o ^ Cs. 
Thus, by Condition SE (/'min(?+ s) is bounded away from zero since ?+ s ^ £„s for large enough 
n with probability 1 — A„. Moreover, Condition PL AD in Appendix [D] is implied by Condition L 
The required side condition of Lemma |4] is satisfied by relations (IF.4ip and (|F.42[) . By Lemma |4] we 
have \a — ao| <p •\/slog(p V n)/n ^ o(l)log~ n under s'^ log {p\J n) ^ (5„n. Note that this implies 
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{a : \a — ao\ ^ ii^^^'^logn} C A (with probability 1 — o(l)) which is required in ILAD(ii) and the 
(shrinking) definition of A estabhshes the initial rate of ILAD(iv). By Lemma [5] in Appendix [Dl we have 
\\x^{/3 — /3o)\\2,n ^p \/slog{n y p)/n since the required side condition holds. Indeed, for Xi — {di, x[)' and 
5 = {5d,5'J, because of Condition SE and the fact that E„[|d,|3] <p E[|d,|3] = 0(1), 

> inf {0min(s+Cs)}-V''||^f 

> {./>„in(s+Cs)}^/^ > 1 

-^ 4_R:;^Vs+Cs0„ax(s+Cs)+4E„[|di|3] ^-^^ K^^' 



Therefore, since K^s^ log^(p V n) ^ (5„n and A < yjn \og{p V n) we have 



Step 2 relies on Post-Lasso. Condition HL in Appendix [D] is implied by Condition I and Lemma [5] 
applied twice with Q = Vi and Q — di under the condition that K'^ logp ^ 5nn. By Lemma[7]in Appendix 
|D]we have \\x'^{9 — 6*0)112, n <p ^slog(n V p)/n and ||6'||o < s with probability 1 — o(l). 

The rates established above for 9 and /3 imply (IA.25P in ILAD(iii) since by Condition I(ii) E[|tii|] ^ 
(EK2])i/2 = 0(1) and maxi^,^„ \x',{e - Bo)] <p K,^sHog{pW n)/n ^ o(l). 

We now verify the last requirement in Condition ILAD(iii). Consider the following class of functions 
Ts = {{y, d, x) ^ l{y s; x'f3 + da}:aeR, ||/3||o < Cs}, 
which is the union of (i'^) VC-subgraph classes of functions with VC indices bounded by Cs. Hence 

l0giV(£, J-,, II • ||p„,2) < Sl0gp+ sl0g(l/£). 

Likewise, consider the following class of functions Qs.r — {{y,d,x) i~> x'9 : ||0||o ^ Cs, ||a;^0||2.ri =^ r}. 
Then 

l0giV(£||G's,r||p„,2,^s, II • ||p„.2) < S logp + S log(l/e), 

where Gs,r{y,d,x) = max||e||Q^c's,||x;9||2.„s;r- \x'0\. 
Note that 

sup |G„(V'„jj- V'a,/io,eo)l < sup |G„(V'„,3^,e- V'„,^,e„)l (B.28) 

+ sup |G„(V'„_o^eo ^ V'a,/3o,eo)l- (B-29) 

Consider to bound (|B.28p . Observe that 

ipa,i3,e{yi,di,Xi) -ipa,i3,eo{yi,d.i,Xi) = -(1/2- l{yi ^ Xif3 + dia})xi{9 - On), 

and consider the class of functions Hi ^ — {{y,d,x) t-^ (1/2— l{y ^ x' (3 + da})x' {6 — Oq) : a e M, ||/3||o ^ 
Cs, \\0\\o ^ Cs, \\x'^{9 — 0o)l|2.n =^ ^} with r < ■\/slog(p V n)/n. Then by Lemma [9] together with the 
above entropy calculations (and some straightforward algebras), we have 



sup |G„(g)| <p y^s log(p V n)y/s log{p V n)/n = op(l), 
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where s^log^(p V n) < (5„n is used. Since \\xi{6 — 0o)\\2,n <p ^slog(n Vp)/n and ||/3||o V ||6'||o < s with 
probabihty 1 — o(l), we conclude that (IB.28P = op(l). 

Lastly consider to bound (|B.29I) . Observe that 

where Vi = di ~ x'^Oq, and consider the class of functions "H^ ^ = {(?/, d, x) i— >■ (Ijy ^ a;'/3 + da} — Ijj/ ^ 
a;'/3o + da}){d - x'Oq) : a e R, ||^||o sJ Cs, \\x[{(3 - /3o)||2,n s^ r} with r < y^slogipV n)/n. Then by 
Lemma [9] together with the above entropy calculations (and some straightforward algebras), we have 



sup |G„(5)| <p ^/slog{pVn) sup jEn[g{yi,di,Xi)'^]\/ E[g{yi,di,Xiy]. 

Here we have 

E[g{y„d,,x,f] ^ C\\xr{/3 - MhAnvf])'^' < Vs\og{pV n)/n. 
On the other hand, 

sup E„[g(yj,di,Xi)^] s^ n"^/^ sup <Gn{g'^) + sup E[g(j/i,di,Xj)^], (B.30) 

and apply Lemma [5] to the first term on the right side of (|B.30p . Then we have 

sup Gn{g'^) <p \/s\og(j)V~n) sup \/En[g{yi,di,Xiy]\/ E[g{yi,di,x. 



J,"l,J'iy J 



geni , 3e«2 



< ^/slog{pVn)^En[vf]V^vf] <p ^/slog{pVn)^/^^f 



Since ||x^(/3 — /?o)||2,n ^p ^slog(rt V p)/n and ||/?||o ^ Cs with probability 1 — A„, we conclude that 

(|R29| <p y^slog{p\/n){s log(p V 7i)/ri)i/* = o(l), 
where s^ log^(p V n) < (5„n is used. D 

Appendix C. Auxiliary Technical Results 

In this section we collect two auxiliary technical results. Their proofs are given in the supplementary 
appendix. 

Lemma 2. Let Xi, . . . , a;„ be non- stochastic vectors in W with maxi^i^„ ||a;i||oo ^ K^. Let ('i, . . . , C„ 
be independent random variables such that 'E[\Q\'^] < oo for some q ^ A. Then with probability at least 

1 - 8t, 

max |(E„ - E)[4cf]| < 4/°iMlIif,^(E[|C.n/r)^/^. 

Lemma 3. Let T = support(/3o), |r| = \\(3o\\o ^ s and \\PtA\i ^ c\\Pt - /^olli- Moreover, let /3(2™^ 
denote the vector formed by the largest 2m components of /3 in absolute value and zero in the remaining 
components. Then for m ^ s we have that f3^'^™> satisfies 



||x^(/3(2™) _ l3o)\\2,n ^ \Kil3 - (3o)h,n + ^/ <PmU'^) / m c||/3t - /?olli, 
where (j)rm,x{m)/m < 20max(s)/s and WPt - /?o||i < \/s\\Xi{P - f3o)\\2,n/Kc- 
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Proposed Method 1: rp(0.05) 
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Figure 1 . The figure displays the empirical rejection probabilities of the nominal 5% level tests of a 
true hypothesis based on different testing procedures: the top left plot is based on the standard post- 
model selection procedure based on 5, the top right plot is based on the proposed post-model selection 
procedure based on a, and the bottom left plot is based on another proposed procedure based on the 
statistic Ln- The results are based on 500 replications for each of the 100 combinations of H?'s in the 
primary and auxiliary equations in 113.1411 . Ideally we should observe the 5% rejection rate (of a true 
null) uniformly across the parameter space (as in bottom right plot). 
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Standard approach: Bias 



Proposed Method: Bias 
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Standard approach: RMSE 
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Figure 2. The figure displays mean bias (top row), standard deviation (middle row), and root mean 
square error (bottom row) for the the proposed post-model selection estimator a (right column) and the 
standard post-model selection estimator 5 (left column). The results are based on 500 replications for 
each of the 100 combinations of R?'s in the primary and auxiliary equations in 1 13.141 1. 
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Supplementary Appendix for "Uniform Post Selection Inference 

for LAD Regression Models" 

Appendix D. Auxiliary Results for li-LAT) and Heteroscedastic Lasso 

In this section we state relevant theoretical results on the performance of the estimators ,^i-LAD, Post- 
^i-LAD, heteroscedastic Lasso, and heteroscedastic Post-Lasso. There results were developed in [3] and 
[2]. The main design condition relies on the restricted eigenvalue proposed in [9], namely for Xi = {di, x'^)' 

Kc- , inf _ \\i'M2,nl\\5T\\. (D.31) 

I "^T^ II 1 ^c||(5t II 1 

where c — (c+ l)/(c— 1) for the slack constant c > 1, see [5]. It is well known that Condition SE implies 
that Kj. is bounded away from zero if c is bounded for any subset T <Z {1, . . . ,p} with \T\ ^ s. 

D.l. £i-Penalized LAD. For a data generating process such that V{yi ^ x[rjQ \ Xi) = 1/2, independent 
across i {i — 1, . . . ,n) we consider the estimation of rjo via the £i-penalized LAD regression estimate 

fje argminE„[|2/i - x^r]\] + -\\t]\\i. 
n n 

As established in j3j and [26j, under the event that 

- ;^ 2c\\Er,[{l/2 - l{y, ^ i'M)i^]\\oo, (D.32) 

n 

the estimator above achieves good theoretical guarantees under mild design conditions. Although r/o is 

unknown, we can set A so that the event in (jD.32p holds with high probability. In particular, the pivotal 

rule discussed in [J proposes to set A — c'nA(l — 7 | i) for c' > c and 7 — >■ where 

A(l - 7 I i) := (1 - 7)-quantile of 2||E„[(l/2 - 1{C/, «: l/2})£,]||oo, (D.33) 

and where Ui are independent uniform random variables on (0, 1), independent of ii, . . . , £„. We suggest 
7 = 0.1/logn and c' = 1.1c. This quantity can be easily approximated via simulations. Below we 
summarize required regularity conditions. 

Condition PL AD. Assume that ||77o|lo = s ^ 1, E„[a;|] = 1 for all 1 ^ j ^ p, the conditional 
density of yi given di, denoted by /i(-), and its derivative are bounded by / and /', respectively, and 
fiii'iVo) 5^ / > is bounded away from zero uniformly in n. 

Condition PLAD is implied by Condition I. The assumption on the conditional density is standard in 
the quantile regression literature even with fixed p or p increasing slower than n (see respectively |14j 
and [S]). Next we present bounds on the prediction norm of the ,^i-LAD estimator. 

Lemma 4 (Estimation Error of £i-LAD). Under Condition PLAD, and using A = c'nA(l — 7 | x), we 
have with probability 1 — 27 — o(l) for n large enough 



Ix'iiv - Vo)h,n 



< bH^} ./slog(p/7) 



Tcrvf riri 
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//' :„f ll^^'^lla." 



provided < ^ + —. =g — == \ if inf ^ n-zu^i ' --■ 

Lemma [4] establishes the rate of convergence in the prediction norm for the ii-hKD estimator in 
a parametric setting. The extra growth condition required for identification is mild. For instance we 
typically have A < ^J\og{n\J p)/n and for many designs of interest we have inf^gAc ll^^i'^llln/^nll^i'^P] 
bounded away from zero (see [3]). For more general designs we have 

.^j \W^^\\2,n ^ .^j \\S:[5\\2.n ^ ^c 

.5gAc E„[|x^i5|3] ^ <5eAc ||(5||imaxi^„ ||xi||oo '' v^(l + c) maxj^„ ||xi||oo 
which implies the extra growth condition under K^s^ log(p V n) ^ (5„K^n. 

In order to alleviate the bias introduced by the ^i-penalty, we can consider the associated post-model 
selection estimate associated with a selected support T 

?j € argmin {En[\y^ - i'M : r/, = if j ^ f } . (D.34) 

The following result characterizes the performance of the estimator in (jD.34p , see [3j for the proof. 

Lemma 5 (Estimation Error of Post-^i-LAD). Assume the conditions of Lemmal^hold, support(^) C T, 
and let s'= \T\. Then we have for n large enough 



\\x'i{ri - r]o)\\2,n ;iP 



^ /(s + s)log(nVp) , A^i , 1 /slog(p/7) 



n(j)minis + s) UKc Kc 



■ J 1 ^\/</>min(s + s) ^\/'^min(s'+s) Iff' , c \\x'-6\\r. „ 

provided I ^ , "^ ' + V '^"■"^ ^ ' L IL. mf " ,'|J"i^ , ->p cx). 

LemmaElprovides the rate of convergence in the prediction norm for the post model selection estimator 
despite of possible imperfect model selection. The rates rely on the overall quality of the selected model 
(which is at least as good as the model selected by ^i-LAD) and the overall number of components ?. 
Once again the extra growth condition required for identification is mild. For more general designs we 
have 

ml „ r, ,,/,o. i5 mt --— ,, _ ,, — ^ 



||<5||o^s+sE„[|i^J|3] ||5||o^s+s ||(5||imaxisc„ ||ii||oo \/s + sniaxi^„ ||ii||oo' 

Comment D.l. In Step 1 of Algorithm 2 we use £i-LAD with Xi = (d^, x'J', 5 := ?]— ^70 = (S— ao, P'—Pq)' , 
and we are interested on rates for ||a;^(/3 — /3o)ll2.ri instead of ||x';(5||2,ri- However, it follows that 

||x-(/3-/3o)||2,« =^ l|ii^l|2,n + |a-ao| • ||<ii||2,n- 

Since s J^ 1, without loss of generality we can assume the component associated with the treatment 
di belongs to T (at the cost of increasing the cardinality of T by one which will not affect the rate of 
convergence). Therefore we have that 

|S - aol ^ \\5t\\ < \\S:[5\\2.,n/nc- 

In most applications of interest ||(ii|J2.n and l/^c are bounded from above with high probability. Similarly, 
in Step 1 of Algorithm 1 we have that the Post-^i-LAD estimator satisfies 
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D.2. Heteroscedastic Lasso. In this section we consider the equation (|1.4|) in the form 

d, = x% + V,, E[z;,] = 0, (D.35) 

where we observe {(di, a;^)'}"^]^, (xi)"^]^ are non-stochastic and normaUzed in such a way that E„[a;? ] — 1, 
for ah 1 ^ j/ ^ p, and {vij^^i are independent across i but not necessary identically distributed. The 
unknown support of do is denoted by Td and it satisfies IT^I ^ s. To estimate do and consequently Vi, we 
compute 

9 € argminE„[(di — a;'0)^] H — \Wd\\i and set Vi — di~ x'^O, i = 1, . . . ,n, (D.36) 

e n 

where A and F are the associated penalty level and loadings which are potentially data-driven. In this 

case the following regularization event plays an important role 

- ^ 2c||f-iE„[x,(d, - x%)]\\^. (D.37) 

n 

As discussed in [9] , [4] and [2 , the event above implies that the estimator 6 satisfies 1 1 6't<: 1 1 1 ^ c 1 1 Ot^ ^ ^o | | i 
where c = (c+ l)/(c— 1). Thus rates of convergence for 9 and Vi defined on (|D.36p can be established 
based on the restricted eigenvalue Kc defined in (|D.31|) with Xi — Xi and T — T^. 

The following are sufficient high-level conditions where again the sequences A„ and (5„ go to zero and 
C is a positive constant independent of n. 

Condition HL. For the model (|D.35p . suppose that for s = s„ ^ 1 we have ||^o||o ^ s and 

(i) max (E[|xy«.|3])i/V(E[|xy«,|2])i/2 ^ c and $-i(l - ^/2p) ^ SnU^^^, 

(n) max |(E„ - E)[xf^v^]\ + max |(E„ - E)[xf,d'^^]\ < (5„, with probability 1 - A„. 

Condition HL is implied by Conditions I and growth conditions (see Lemma [5]) . Several primitive 
moment conditions imply the various cross moments bounds. These conditions also allow us to invoke 
moderate deviation theorems for self-normalized sums from |12| to bound some important error compo- 
nents. Despite heteroscedastic non-Gaussian noise, Those results allows a sharp choice of penalty level 
and loadings was analyzed in [2^ which is summarized by the following lemma. 

Valid options for setting the penalty level and the loadings for j ~ 1, . . . ,p, are 



initial % = JE„[4(d, -d)2], A = 2cV^$-i(l - j/{2p)), 

'y (D.38) 

refined % = ^E,-,[xldf], A = 2c^^-^{l - -f/{2p)), 

where c > 1 is a constant, 7 G (0, 1), d := E„[(i,j] and v^ is an estimate of Vi based on Lasso with the 
initial option (or iterations). [2] established that using either of the choices in (jD.38[) implies that the 
regularization event (jD.37|) holds with high probability. Next we present results on the performance of 
the estimators generated by Lasso. 

Lemma 6. Under Condition HL and setting A = 2c'y^$^^(l — ^/2p) for c' > c> 1, and using penalty 
loadings as in \D.38\) . there is an uniformly bounded c such that we have 

\\vi - ni.\\2.n = 112:^(6' - 6*0)112. n <p and \\vi - Ui,\\oa ^ \\9 - ^ollimaxllxilloo- 
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Associated with Lasso we can define the Post-Lasso estimator as 

ee argmin |E„[(di - x^e)^] : 6*^ = if 6^ = o| and set w, ^ d^ ~ x'J). (D.39) 

That is, the Post-Lasso estimator is simpiy the least squares estimator applied to the covariates selected 
by Lasso in (ID.36[) . Sparsity properties of the Lasso estimator 6 under estimated weights follows similarly 
to the standard Lasso analysis derived in [2 . By combining such sparsity properties and the rates in the 
prediction norm we can establish rates for the post-model selection estimator under estimated weights. 
The following result summarizes the properties of the Post-Lasso estimator. 

Lemma 7 (Model Selection Properties of Lasso and Properties of Post-Lasso) . Suppose that Conditions 
HL and SE hold. Consider the Lasso estimator with penalty level and loadings specified as in Lemma [51 
Then the data- dependent model Td selected by the Lasso estimator 9 satisfies with probability 1 — A„.- 

Il^llo = \fd\ < s. (D.40) 

Moreover, the Post-Lasso estimator obeys 



V^h.n ^ \U(e - eo)h.n <P 



s log{p V n) 



Appendix E. Alternative Implementation via Double Selection 

An alternative proposal for the method is reminiscent of the double selection method proposed in [6] 
for partial linear models. This version replaces Step 3 with a LAD regression of y on d and all covariates 
selected in Steps 1 and 2 (i.e. the union of the selected sets). The method is described as follows: 

Algorithm 3. (A Double Selection Method) 
Step 1 Run Post-£i-LAD of y^ on di and xf. 

{a, 13) £ argmin En[\yi - dia - x'i^l] + — ||(a,^)||i. 
a,p n 

Step 2 Run Heteroscedastic Lasso of di on Xi : 

e& argmin E„f(d, - x[6f] + — ||f6l||i. 
e n 

Step 3 Run LAD regression of yt on di and the covariates selected in Step 1 and 2: 

(a,/3) e argmin {E„[|yi — dia — x'^PW : support(/3) C support(/3) U support(0)}. 

The double selection algorithm has three steps: (1) select covariates based on the standard ^i-LAD 
regression, (2) select covariates based on heteroscedastic Lasso of the treatment equation, and (3) run a 
LAD regression with the treatment and all selected covariates. 

This approach can also be analyzed through Lemma [T] since it creates instruments implicitly. To see 
that let T* denote the variables selected in Step 1 and 2: T* = support(/3) U support(6'). By the first 
order conditions for (d,/3) we have 

||E„[^(y, - d,a - x'Md,,x'^J]\\ = 0{{ max |d,| + K,\f*\^/^){1 + |f *|)/n}, 
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which creates an orthogonal relation to any linear combination of {di, x'-^)'. In particular, by taking the 
linear combination {di, x'- )(1, —6'-^)' — di — a;'- 0^^. = di — x[9 — 'zi, which is the instrument in Step 2 
of Algorithm 1 , we have 

E„[^(y, ~ d,a - x[p)%] - 0{|1(1, -^')'ll( max |rf,| + K,\f*\^/^){1 + \f*\)/n}. 

As soon as the right side is op(n^^"), the double selection estimator a approximately minimizes 

1 r \^ \^n[f{yi - dia - x'^i3)z.i\\^ 

where % is the instrument created by Step 2 of Algorithm 1. Thus the double selection estimator can be 
seen as an iterated version of the method based on instruments where the Step 1 estimate /3 is updated 
with j3. 

Appendix F. Proof of Theorem [2] 

Proof of Theorem [H We will verify Condition ILAD and the desired then follows from Lemma [TJ The 
assumptions on the error density /£(•) in Condition ILAD(i) are assumed in Condition I(iv). The moment 
conditions on di and Vi in Condition ILAD(i) are assumed in Condition I(ii). 

Condition SE implies that k.^ is bounded away from zero with probability 1 — A„ for n sufficiently 
large, see [9]. Step 1 relies on fi-LAD. Condition PLAD is implied by Condition I. By Lemma H] and 
Comment ID.ll we have 



||x-(^-/?o)||2.n <P ^s\og{nyp)/n and |S - ao| <P ^/ s log(p V n)/n < o(l) log ^n 

because s^ log {n\J p) ^ ^„n and the required side condition holds. Indeed, without loss of generality 
assume that T contains the treatment so that for Xi ~ {di,x'^y, 6 = {6d,S'^y, because of Condition SE 
and the fact that ]E„[|dip] <p E[\di\^] = 0(1), we have 

^ ■ f I|5-'5|l2 JI<5tI|kc 

^ int^gAc 



^ . r mSWlJlSrWi^^J^ 

=^ ™*eAe 8A:,(l+c)||5T||l||5^5|li,„+8K.(l+c)||5T||l|5d|2{||di||^^„+E„[|di|3]} 

-^ Kc/Vs ^ 1 

•^ "" '- '" ' ' -" ■- ■■■•"■ -"' --P yiifx ■ 



(F.41) 



Therefore, since A < -y/n log(p V n) we have 

A^/i+ Vs^nog(^7V^'56Ae E„[|i^5|3] -^-^ X^slog(pVn) ^-^ °° 

under K^s^ log (p V n) ^ (5„n. Note that the rate for a and the definition of A implies {a : |a — aoj ^ 
n^^'"^ \ogn] C A (with probability 1 — o(l)) which is required in ILAD(ii). Moreover, by the (shrinking) 
definition of A we have the initial rate of ILAD(iv). Step 2 relies on Lasso. Condition HL is implied by 
Condition I and Lemma [5] applied twice with Q = v.i and C,i = di under the condition that K"^ logp ^ (5„n. 
By Lemma[6]we have ||a:^(^ — 6'o)||2,n ^p \/ s\og{n \J p)/n. Moreover, by Lemma[7]we have H^Ho < s with 
probability 1 — o(l). 
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The rates established above for 6 and /3 imply (jA.25|) in ILAD(iii) since by Condition I(ii) E[|tii|] ^ 
(EK2])i/2 = 0(1) and maxi^,^„ \x',{e - Bo)] <p K,^sHog{pW n)/n = o(l). 

To verify Condition ILAD(iii) (jA.26|) . arguing as in the proof of Theorem [U we can deduce that 

sup |G„(V'„s§-- V'a,/3o,eo)l =Op(l). 

This completes the proof. 

D 

Appendix G. Proof of Auxiliary Technical Results 

Proof of Lemma\^ We shall use Lemma [5] ahead. Let Zi — {xiXi) and define F — {fj{xi,Ci) = ^fjCf ■ 
j = 1, . . . ,p}. Since P(|A:| > t) < E[\X\'']/t'', for fc = 2 we have that median(|X|) ^ ^2E[|X|2] and for 
k = g/4 we have (1 - T)-quantilc of \X\ is bounded by (E[\X\'^/^]/t)^/''. Then we have 



maxmedian(|G„(/(a;„CO)l) «= V2E[4Cf] ^ K^2E[Cf\ 



and 



(1 - T)-quantile of max ^/E„[a;4^-C4] <; (^ _ T)-quantile of K^JE„[Cf] < K^{E[\Q\i]/t)^/'^. 



The conclusion follows from Lemma ID D 

Proof of Lemma [3 By the triangle inequality we have 

Now let T^ denote the m largest components of /3 and T*"' corresponds to the m largest components of /3 
outside U^iZlT'^. It follows that /3(2™) = /Stiut^- 

Next note that for fc ^ 3 we have |j/3yfc+i|| ^ \\/3rpk\\i/y/m. Indeed, consider the problem max{||w||/||u||i : 
v,u e R™,maxi \vi\ ^ mini \ui\}. Given a v and u we can always increase the objective function by 
using V = maxi |t;i|(l, . . . , 1)' and ii' = mini |Mi|(l, • • • , 1)' instead. Thus, the maximum is achieved at 
V* = u* ^ {1, . . . , 1)', yielding 1/^/m. 

Thus, by ||/3t<:||i ^ c||(5t||i and \T\ = s we have 

^ fj. ? — Tv^-R"-! VTkh ^ n 7 — 7ll/3(Ti)clli 



^ \/0max('7^) ^^ < V'/'max("l)cJ 



T 111 



D 
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Appendix H. Auxiliary Probabilistic Inequalities 

Let Zi,...,Z„ be independent random variables taking values in a measurable space (5,5), and 
consider an empirical process G„(/) = n^^''^ 2r=i{/('^«) ~ I^[/(-^«)]} indexed by a pointwise measurable 
class of functions J^ on S* (see [25], Chapter 2.3). Denote by P„ the (random) empirical probability 
measure that assigns probability n~^ to each Zi. Let N{e,J-, \\ ■ ||p„,2) denote the e-covering number of 
T with respect to the L^(P„) seminorm || • ||p„.2. 

The following maximal inequality is derived in [3]. 

Lemma 8 (Maximal inequality for finite classes). Suppose that the class J- is finite. Then for every 
T e (0, 1/2) and S G (0, 1), with probability at least 1 — 4t — 46, 

max|G„(/)| s; |4v/21og(2|J'|/(5) Q{1 - t)\ V 2maxmedian(|(G„(/)|) , 



where Q{u) :— u-quantile of maxygjr ■\/E„[/(Zi)2]. 

The following maximal inequality is derived in JS]. 

Lemma 9 (Maximal inequality for infinite classes). Let F — supj-gjr|/|, and suppose that there exist 
some constants aj„ > \, v > \, m > 0, and /i„ ^ /iq such that 

^(e||i^||p„,2,^, II • ||p„,2) «; (n V /i„)™(w„/e)'^'", < e < 1. 
Set C := (1 + ^/2v)/A. Then for every 5 £ (0, 1/6) and every constant K ^ ^2/5, we have 

sup |(G„(/)| s; AV2cKC^m log(n V hn V w„) max J sup J^f{Z,y], sup VEn[/(^0'] \ , 
with probability at least 1 — 5, provided that n\/ Hq ^ 3; the constant c < 30 is universal. 
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